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(57) ABSTRACT 

There is provided a method for reordering and renaming 
memory references in a multiprocessor computer system 
having at least a first and a second processor. The first 
processor has a first private cache and a first buffer, and the 
second processor has a second private cache and a second 
buffer. The method includes the steps of, for each of a 
plurality of gated store requests received by the first pro- 
cessor to store a datum, exclusively acquiring a cache line 
that contains the datum by the first private cache, and storing 
the datum in the first buffer. Upon the first buffer receiving 
a load request from the first processor to load a particular 
datum, the particular datum is provided to the first processor 
from among the data stored in the first buffer based on an 
in-order sequence of load and store operations. Upon the 
first cache receiving a load request from the second cache for 
a given datum, an error condition is indicated and a current 
state of at least one of the processors is reset to an earlier 
state when the load request for the given datum corresponds 
to the data stored in the first buffer. 

25 Claims, 8 Drawing Sheets 
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METHODS AND APPARATUS FOR references. Furthermore, store operations to the same 

REORDERING AND RENAMING MEMORY address must be presented to the main memory in original 

REFERENCES IN A MULTIPROCESSOR program order. 

COMPUTER SYSTEM In a multiprocessor environment, additional restrictions 

BACKGROUND s arc P° set * 0D mc ordering of memory operations. To achieve 

predictable and repeatable computation of programs in a 

1. Technical Field multiprocessor environment, a requirement of 'sequential 
The present invention generally relates to computer pro- consistency' is described in the article by L. Lamport, "How 

cessing systems and, in particular, to methods and apparatus to Make a Multiprocessor that Correctly Executes Multi- 

for reordering and renaming memory references in a mul- process Programs", IEEE Transactions on Computers, C-28 

tiprocessor computer system. (9), pp . 690-91 (September 1979). The article by Lamport 

2. Background Description defines a multiprocessor system as sequentially consistent if 
Contemporary high-performance processors rely on "the result of any execution is the same as if the operations 

superscalar, superpipelining, and/or very long instruction °f all processors were executed in some sequential order, 
word (VLIW) techniques for exploiting instruction-level and & c operations of each individual processor appear in 
parallelism in programs (i.e., for executing more than one m is sequence in the order specified by its program". For 
instruction at a time). In general, these processors contain static speculative execution, the order of the original logical 
multiple functional units, execute a sequential stream of P ro S ram te5rt 15 ^ontative, not the reordered program text, 
instructions, are able to fetch from memory more than one f nd thc com P llcr md h " dw ». re "°P^ntaUon must col- 
instruction per cycle, and are able to dispatch for execution M la * orate t0 8 enerate m execuUon «P»vatenl to *at onginal 
more than one instruction per cycle subject to dependencies or cr * . . 
and availability of resources. To adneve P ro P er performance while simplifying coher- 
The pool of instructions from which the processor selects ence Protocols between multiple processors in a computer 
those that are dispatched at a given point in time is enlarged system, several relaxations of the above described strictly 
bytheuseofout-of-orderexecution.Out-of-orderexecution „ sequential consistent order are possible. The types of 
is a technique by which the operations in a sequential stream 25 ordering which are allowable depend on the memory 
of instructions are reordered so that operations appearing "nsistency model guaranteed by a part.cular implementa- 
later are executed earlier, if the resources required by the ^ ™™ e * of currcnt and P™Pf cd "oasiency 
later appearing operations are free. Thus, out-of-order ™«f ™* &™ characteristics is provided in (he article by 
execution reduces the overall execution time of a program 30 f' Ad ^ c and * Gte.cl^oo Sh ™ d ™ C ™? 5°™*' 
by exploiting the availabiUty of the multiple functional units Models: ATutonal , Technical Report 9512 Dept. of 
and using resources that would otherwise be idle. Reorder- Electrical and Computer Engmeenng, Rice University, 
ing the execution of operations requires reordering the Houston, Tex. (September 1995). 

results produced by those operations, so that the functional Typically, these requirements impose additional restnc- 
behavior of the program is the same as what would be 35 ^ 00 * c ordcr of milUi P lc storc operations (even (hose 
obtained if the instructions were executed in their original store operations referring to different addresses), and of load 
sequential order and slorc operations (executed by the same processor or 
In the case of memory -related operations, a memory store different Processors, »d referring to the same address or 
operation stores a datum in memory. A later memory load dlffercnt addresses) with respect to each other, 
operation may read this datum from memory, load the datum i0 ™ c &tsc requirements guarantee the correct operation 
into a processor register and, as is frequently the case, start of programs designed to work in the context of such coher- 
a sequence of operations that depend on the datum. When cnc V protocols, they impose limitations on the order of 
directly bypassing such values from the store operation to a operation as instructions are executed in the processor. To 
subsequent load operation, a slow main memory access may achicvc hl S h performance while adhering to processor con- 
be substituted by a faster register-to-register access. In 45 sistency models, a processor must be able to reorder memory 
addition to using idle resources, the bypassing of such values operations internally and bypass results between them, but 
reduces the critical path (i.e., the sequence of operations P resent ^ memor y operations to the memory system 
which determines the minimum execution time possible for in-order. 

a given code fragment) and reduces the number of memory Accordingly, it would desirable and highly advantageous 
operations which must be processed by the memory system. 5 o to bave su PP ort for followm S features in a high perfor- 
An additional performance improvement can be achieved by mance memory interface of a multiprocessor computer 
speculatively executing store operations out-of-order. Other system implementing out-of-order execution, so as to pro- 
benefits are the ability to reorder multiple store and load vide maximum scheduling freedom: 
references to the same memory location by using a tech- 1. The ability to execute storc operations out-of-order, but 
nique referred to as "renaming of memory locations". 5S retire them to memory in-order. 

In general, there are two basic approaches to implement- 2. The ability to speculatively perform store operations, 

ing out-of-order execution and reordering of results: coupled with the ability to undo such store operations 

dynamic reordering and static reordering. In dynamic transparently (i.e., without influencing the correctness of 

reordering, the instructions are analyzed at execution lime, program execution in a multiprocessor system), 

and the instructions and results are reordered in hardware. In eo 3. The ability to hold multiple store result values for the 

static reordering, a compiler/programmer analyzes and rear- same memory address, and resolve load references to these 

ders the instructions and the results produced by those values, while at the same time retiring store values in 

instructions when the program is generated, thus the reor- original program order to the memory system, 

dering tasks are accomplished through software. These two Some example code sequences will now be given to 

approaches can be jointly implemented. 65 illustrate the performance impact of implementing the above 

To ensure that such operations are performed correctly, features in a processor supporting out-of-order execution of 

there must exist a mechanism to undo speculative memory store operations. 
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With respect to the ability to execute store operations 
out-of-order with respect to each other, consider the follow- -continued 

ing in-order code fragment, where the latency of a multiply 

(MUL) operation is three cycles, and that of all other ^ ^' j^gj 

operations is 1 cycle. s ai ^ j 



Similar sequences of load and store operations routinely 

mul r3 - i4, rS occur in the presence of function calls, where frequently 

ST i3, 16(ip) retrieved parameters corresponding thereto are stored on the 

u t4 ~° 10 stack. For an article describing the serializing effects of stack 

^ r4, 20(fp) references, see "The Limits of Instruction Level Parallelism 

in SPEC95 Applications", Postiff et at, International Con- 

Thc preceding code fragment will execute on a single issue fereoce on Architectural Support for Programming Lan- 

out-of-order processor without out-of-order execution of e £ ua T ges and Operating Systems (ASPLOS-VIII) Workshop 

. ■ <- „,,„i„ „ r 1- 15 on Intcraaion between Compilers and Computer Architec- 

store operations m 5 cycles as tallows: , IAr ™,„ , „, „ 4 , r tono r 

v hires (INTERACT- 3), October 1998. 

Consider the execution of the preceding code fragment 
with a latency of 5 cycles for the divide instruction (Dl V) on 
a single issue out-of-order processor without the capability 
20 to rename store memory references to the same address. The 
resulting schedule will require 10 cycles to execute: 



MUL 


r3 - r4, r5 


LI 


r4-0 


NOP 




ST 


r3, 16(fp) 


ST 


r4, 20(fp) 



With the capability to perform store operations out-of- js 
order, the processor requires only four cycles using the 
following schedule: 



MUL r3 - r4, r5 

U r4-0 

ST r4, 20(fp) 

ST r3, l(5(fp) 



30 



Drv 


r3 - r4, tS 


u 


r5,0 


NOP 




NOP 




NOP 




ST 


i J, 12(rS) 


L 


r4, 12(r8) 


ST 


r5, 12<r8) 


L 


rf, 12{rS) 


AI 


r6, 1 



With renaming of memory locations and in-order retire- 
With respect to control-speculative execution of store 35 ment of stored values, the code can be executed in 7 cycles, 
operations, consider the following code fragment: In the following schedule, multiple live names are indicated 

in square brackets which denote the order of retirement to 
the memory subsystem. 



CMPUciO, r4, 0 
BTRUEcrO.eq, label 
ST r4, 12(fp) 

label: 



If we assume that the branch is predicted to not be taken 45 
most of the time, and branch resolution requires 3 cycles 
from a compare operation to a branch operation using the 
condition, then the above code requires 5 cycles to execute 



DIV 


t3 - r4, r5 


U 


i5,0 


ST 


,5, 12(i8)|2] 


L 


16, 12(rS) [2] 


AI 


16, 1 


ST 


i3, 12(18)11] 


L 


14, 12(18) [1] 



even if the branch is correctly predicted to not be taken. This The examples above are not atypical; the necessity to 

is because the store operation cannot be executed specula- 50 perform store operations in-order with respect to both 
tively once the branch has been predicted as not taken, since branches and other store operations degrades performance 
store operations cannot be undone. fairly severely by forcing the sequential execution of opera- 

In contrast, the store operation could be performed specu- lions that could otherwise be executed in parallel. However, 
latively in a memory system supporting the ability to undo such a serialization can be avoided (that is, the load opera- 
store operations transparently. In such a case, if the branch 55 lion can be performed earlier than the store operation) as 
is predicted correctly, then the above code fragment can long as actual processor execution can be decoupled from 
execute in 3 cycles on a single issue out-of-order processor. the sequence of data values presented to main memory and 
Finally, to execute store operations to the same address other processors in a multiprocessor environment. Thus, 
out-of-order and correctly resolve references, consider the some store operations are performed earlier than other store 
following code fragment: 60 operations, and speculatively with respect to unresolved 

branches; load operations can reference such values out-of- 
order. Moreover, if load references are renamed correctly 
with respect to multiple store operations to the same address, 
any operation that depends on the datum loaded out-of-order 
L* 65 can also be performed out-of-order. 

li rs', 0 A brief description of the operation of memory requests in 

a multiprocessor system will now be given with reference to 



DIV r3 - 14. r5 

ST r3, ]2(r8) 
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FIGS. 1-3. FIG. 1 is a block diagram illustrating a simplified 
multiprocessor computing system 100 with private caches, 
to which the present invention may be applied. FIG. 2 is a 
flow diagram illustrating the actions taken upon receiving a 
memory access request in a multiprocessor environment 
having a memory hierarchy with private caches, according 
to the prior art. FIG. 3 is a flow diagram illustrating the 
actions taken by a processor 106 upon receiving a cache 
cross-interrogate request from a processor 102, according to 
the prior art. It is to be appreciated that the method of FIG. 
3 is performed by processor 106 in response to step 224 of 
FIG. 2. 

Referring to FIG. 1, multiprocessor computing system 
100 includes a central processing unit (hereinafter "proces- 
sor" or "CPU") 102 operatively coupled to a private cache 
104, and a processor 106 operatively coupled to a private 
cache 108. 

A private cache is cache which services memory requests 
of a single processor. A shared cache services the memory 
requests of multiple processors. Caches can also be shared 
with respect to some processors, but private with respect to 
one or more other processors. 

The processor 102 and private cache 104 comprise a node 
103 of system 100. The processor 106 and private cache 108 
comprise a node 107 of system 100. 

It is to be appreciated that more than two nodes may be 
present in a multiprocessor computer system. Moreover, 
while private cache 104 and private cache 108 each imply a 
single cache unit, they are intended to also include a cache 
hierarchy that includes a plurality of caches (e.g., cache 
hierarchy 104 and cache hierarchy 108). Nonetheless, a 
single cache is implied in the remainder of this document for 
ease of understanding. It is to be further appreciated that a 
cache may be a shared cache with respect to some 
processors, and a private cache with respect to other pro- 
cessors. Given the teachings of the present invention pro- 
vided herein, one of ordinary skill in the related art will be 
able to contemplate these and similar implementations of the 
elements of the invention. 

Node 103, node 107, a main memory 110, a main memory 
112, and input/output (I/O) devices 114 are all operatively 
coupled to each other through system bus 116. I/O 114 
collectively refers to I/O adapters (e.g., video card) and the 
I/O devices (e.g., monitor) operatively coupled thereto. 
Main memory 110 and main memory 112 arc shared 
between node 103 and node 107. It is to be appreciated that 
more elaborate interconection structures may be employed 
in place of system bus 116. 

Referring to FIG. 2, processor 106 maintains control of a 
plurality of memory locations. Processor 102 accesses one 
of the plurality of memory locations according to the method 
of FIG. 2. 

The cache controller of processor 102 receives a request 
for data from processor 102 (step 210). It is then determined 
whether the request refers to a memory location in private 
cache 104 of processor 102 (step 212). If the request does 
not refer to a memory location in private cache 104 of 
processor 102, then the method proceeds to step 220. 

However, if the request refers to a location in private 
cache 104 of processor 102, then it is determined whether 
the entry in private cache 104 corresponding to the location 
has the required permissions (e.g., if a write request has been 
issued, whether the cache entry is in exclusive ownership 
mode which allows the write request to proceed, or in shared 
ownership mode, which only allows read requests to be 
processed) (step 214). If the entry does not have the required 
permissions, then the method proceeds to step 220. 
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However, if the entry has the required permissions, then the 
request is satisfied from private cache 104 (step 218), and 
the method is terminated. 

At step 220, it is determined, via a cross-interrogate from 
processor 102 to processor 106, whether the location is 
resident in private cache 108 of processor 106. If the 
location is not resident in private cache 108 of processor 
106, then the method proceeds to step 232. However, if the 
location is resident in private cache 108 of processor 106, 
then the location is requested with the appropriate permis- 
sions (via a cross-interrogate), and the method proceeds to 
step 234 (step 224). 

At step 232, the memory location is fetched from main 
memory, and the method proceeds to step 234. At step 234, 
the memory location (which was either fetched at step 232 
or received (in response to the cross-interrogate) at step 224) 
is stored in private cache 104, the request is satisfied, and the 
method is terminated. 

Referring to FIG. 3, when processor 106 receives a cross 
interrogate request from processor 102 (step 310), processor 
106 searches a private cache directory to determine whether 
the request refers to a location in private cache 108 of 
processor 106 (step 312). If the request does not refer to a 
location in private cache 108 of processor 106, then the 
method is terminated (step 314). However, if the request 
refers to a location in private cache 108 of processor 106, 
then private cache 108 returns the location to the requesting 
processor (i.e., processor 102) and the method is terminated 
(step 316). 

3. Problems With the State of the Art 

A description of the prior art and deficiencies associated 
therewith will now be given. For example, an address 
resolution buffer which supports out-of-order execution of 
memory operations and memory renaming is described by 
M. Franklin and G. Sohi, in "ARB: A Hardware Mechanism 
for Dynamic Reordering of Memory References", IEEE 
Transactions on Computers, Vol. 45, No. 5, May 1996. At 
least one problem with this approach is that il does not 
address multiprocessor issues, since it is limited to unipro- 
cessor implementations. 

U.S. Pat. No. 5,911,057, entitled "Superscalar Micropro- 
cessor Having Combined Register and Memory Renaming 
Circuits, Systems, and Methods", issued on Jun. 8, 1999, the 
disclosure of which is incorporated herein by reference, 
describes an architecture for renaming memory and register 
operands in uniform fashion. Memory coherence is based 
upon snooping memory requests. While this approach is 
sufficient for the in-order execution of memory operations in 
a multiprocessor computing system, out-of-order operation 
may generate incorrect results in a multiprocessor system. 
U.S. Pat. No. 5,838,941, entitled " Out-of-order Superscalar 
Microprocessor With a Renaming Device that Maps Instruc- 
tions From Memory to Registers", issued on Nov. 17, 1998, 
the disclosure of which is incorporated herein by reference, 
describes symbolic renaming of memory references. At least 
one problem with this approach is that it does not address 
multiprocessor issues, since it is limited to uniprocessor 
implementations. 

U.S. Pat. No. 5,872,990 (hereinafter the "'990 Patent"), 
enlided "Reordering of Memory Reference Operations and 
Conflict Resolution via Rollback in a Multiprocessing 
Environment", issued on Feb. 16, 1999, assigned to the 
assignee herein, the disclosure of which is incorporated 
herein by reference, uses a checkpointing and rollback 
scheme to implement strong memory consistency in multi- 
processing systems with shared caches. While shared cache 
architectures offer a simpler execution model for multipro- 
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mc iragrneni assuring luii uic vaiuc 


id locauon ^roj never 


contains the value 0: 




Processor 1 


Processor 2 


cmpwi crO - r4, 0 


Iwz i2, (r8) 


btruc crO.eq, label 




stw r4, (r8) 




label: 




Out-of-order execution may generate the following out- 


of-order sequence: 




Processor 1 


Proocstor 1 


1 cmpwi crO - f4,0 




1 gated __stw r4, (rS) 




3 btrue crfteq, label 


Iwz r2, (r8) 


4 commit_gated_stares 1 




5 br done 




6 label: revoke_galcd_ttore 1 




5 done: 
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cesser systems, their scalability is limited by the number of 
cache ports and a number of physical factors (wire length, 
wire density, chip size, and so forth). Further, the '990 Patent 
does not address the issues of a system with private caches. 

U.S. Pat. No. 5,832,205 (hereinafter the "'205 Patent"), s 
entitled "Memory Controller for a Microprocessor for 
Delecting a Failure of Speculation on the Physical Nature of 
a Component Being Addressed", issued on Nov. 3, 1998, the 
disclosure of which is incorporated herein by reference, 
describes a gated store buffer architecture for use in a 10 
uniprocessor system. This gated store buffer is used to allow 
rollback of the architecture state in the presence of error 
conditions. However, the claimed architecture does not 
support memory renaming. Further, as stated above, its 
operation is limited to uniprocessor computer systems. 15 

FIG. 4 is a block diagram illustrating a gated store buffer 
410 according to the prior art, namely, U.S. Pat. No. 5,832, 
205. The store buffer 410 consists of a queue 412 with three 
pointers, i.e., a head pointer 414, a gate pointer 416, and a 
tail pointer 418. The head pointer 414 indicates the starting 20 
point of entries in the store buffer 410. Memory data stored 
in the store buffer 410 between the head pointer 414 and the 
gate pointer 416 are committed, and form a part of the 
logical memory image. The gate pointer 416 marks the 
beginning of uncommitted memory data. The end of such 25 
data is marked by the tail pointer 418. Uncommitted data can Consider a case where r4 contains the value 0, and a 
either be committed in the case of successful operation, or cross-interrogate request from processor 2 is answered with 
discarded when a rollback occurs. the value stored by the gated store instruction (instruction 2). 

The architected processor state is modified in conjunction 3Q Then, instruction 3 on processor 2 receives the value depos- 
wilh the commit or rollback of the store buffer architecture. ited by the gated store buffer (instruction 2 on processor 1). 
Operation of the store buffer architecture is limited to a Even if that store instruction is later revoked (instruction 6), 
particular code generation strategy based on static schedul- m e value has already been incorrectly loaded by processor 
ing at runtime (including a binary translation step) using a 2 and is used for further processing, leading to incorrect 
described underlying Very Large Instruction Word (VLIW) 35 j^its. 

architecture ("morph host") with support for processor state The second example illustrates the incorrect operation of 
commitment or rollback. After a rollback, operation is programs in the presence of gated store buffers when data in 
restarted using in^rdcr execution. ^ gatcd slQre buffcr b ignorcd and ^ originaI vahlc is 

While gatcd store buffers offer desirable properties to suppu ed on a cross-interrogate. The second example also 
achieve high performance implementations, their use has not 40 n tustrates how a deadlock situation can occur if the 
been possible in a multiprocessor (MP) environment since responses l0 cross-interrogate requests arc delayed until data 
typical gated store buffer implementations can result in ^ ^ d ^ buffeR , have been resolved 
incorrect operation and/or deadlock situations in an MP For me example> ^ foUowing ^ 

environment. fragments corresponding to in-order programs executing on 

The problem surfaces during actions which are to be taken d5 lwo processors . i n ibe second example, it is presumed that 
on a cross-interrogate from a requesting processor. When the Kgisicr r8 hotds lhc ^ mcmory addrcss on ^ proccs . 
location requested by a cross-interrogate is found m a gated SOfS Mso> register r9 holds the same value on both proces- 
storc buffer, the following actions are possible: sor5 Furthermore, registers r8 and r9 refer to distinct, 

1. Supply the data found in the gatcd store buffcr. This can non-overlapping memory locations, and data memory is 
result in an incorrect value supplied to the other processor 50 initialized to contain 0 in both locations: 

if the data in the gated store buffcr is later discarded, e.g., 
due to incorrect speculation. This violates the requirement 

of transparent execution of incorrectly speculated opera- 

tions. Processor 1 Processor 2 

2. Ignore the data in the gated store buffer. This violates 55 
memory consistency requirements and may result in 
incorrect operations, 

3. Wait for the data in the gatcd store buffer to be resolved 
before sending a response to a cross-interrogate request, 
i.e., either committed to the memory state or revoked from 60 

the store buffer. This can result in a deadlock situation. The programs are based on a well-known test case for 

Consider the following examples, which illustrate the coherent memory implementation. To execute correctly, 

danger of incorrect program execution. The first example is register r4 must contain the value "1" on at least one 

provided to illustrate incorrect operation when a value processor after execution. 

returned from the store buffer is later revoked. 65 Now, consider a program which has been reordered to 

For the first example, consider the following code frag- achieve a better instruction schedule. The program uses the 
ment corresponding to an in-order sequence in a program, capabilities provided by a gated store buffer to ensure that 
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store results from instruction 2 are retired to memory before 
the store results from instruction 4. 
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In an implementation ignoring data in gated store buffers 
and supplying a previous data value, both load operations at 
instruction 4' will be supplied not with the data value 
deposited by gated store operations at instruction 3'. Rather, 
the pre-initialized data value of 0 will be supplied to both 
load operations at instruction 4*. This corresponds to an 
incorrect execution of the multiprocessor program. 

The following third implementation choice, delaying 
answers to cross-interrogate requests until gated store buffer 
data is committed which references data in the store buffer, 
leads to a dead-lock when both processors wait for the 
results from cross-interrogate requests to resolve the load 
operations at instruction 4': 
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Thus, having demonstrated the inadequacy of gated store 
buffers and other prior art approaches to reordering and iQ 
renaming memory references in a multiprocessor 
environment, it can be seen that there is a need for a better 
architecture and/or methodology for reordering and renam- 
ing memory references in a multiprocessor computer sys- 
tem. 45 

SUMMARY OF THE INVENTION 

The problems stated above, as well as other related 
problems of the prior art, are solved by the present invention, 
a method and apparatus for reordering memory operations in 50 
a processor. 

In a first aspect of the invention, there is provided a 
method for reordering and renaming memory references in 
a multiprocessor computer system having at least a first and 
a second processor. The first processor has a first private 55 
cache and a first buffer, and the second processor has a 
second private cache and a second buffer. The method 
includes the steps of, for each of a plurality of gated store 
requests received by the first processor to store a datum, 
exclusively acquiring a cache line that contains the datum by eo 
the first private cache, and storing the datum in the first 
buffer. Upon the first buffer receiving a load request from the 
first processor to load a particular datum, the particular 
datum is provided to the first processor from among the data 
stored in the first buffer based on an in-order sequence of 65 
load and store operations. Upon the first cache receiving a 
load request from the second cache for a given datum, an 



error condition is indicated and a current state of at least one 
of the processors is reset to an earlier state when the load 
request for the given datum corresponds to the data stored in 
the first buffer. 

In a second aspect of the invention, the method further 
includes the step of committing at least some of the data in 
the first buffer to an architected memory state of the com- 
puter system, prior to the indicating step, to remove the at 
least some of the data from the first buffer. The indicating 
step is performed only when the given datum in the first 
buffer is not committed. 

In a third aspect of the invention, the committing step 
commits a specified datum to the architected memory state 
of the computer system when a gated store request corre- 
sponding to the specified datum is in-order with respect to all 
instructions that precede the gated store request. 

In a fourth aspect of the invention, the resetting step 
includes the step of discarding at least some of the data in the 
first buffer. 

In a fifth aspect of the invention, the resetting step 
includes the step of discarding the given datum from the first 
buffer and all data stored thereafter. 

In a sixth aspect of the invention, the method further 
includes the step of releasing the cache line when operations 
referring to the cache line have completed execution 
in-order. 

In a seventh aspect of the invention, the method further 
includes the step of releasing the cache line, when the datum 
contained within the cache line is committed to an archi- 
tected memory state of the computer system in-order or 
when the datum is discarded from the first buffer. 

In an eighth aspect of the invention, the earlier state 
corresponds to an operation immediately preceding the 
gated store request that stored the given datum in the first 
buffer. 

In a ninth aspect of the invention, the method further 
includes the step of generating a snapshot of the earlier state. 

In a tenth aspect of the invention, the generating step 
includes one of the steps of copying contents of registers 
corresponding to the earlier state, and maintaining a record 
of incremental state changes from at least one state preced- 
ing tbe earlier slate up to the earlier state. 

In an eleventh aspect of the invention, the method further 
includes the step of storing a snapshot of tbe earlier state in 
the first buffer. 

In a twelfth aspect of the invention, the method further 
includes the step of storing a snapshot of the earlier state in 
one of the first processor, the second processor, and a storage 
device external thereto. A limestamp corresponding to the 
snapshot of the earlier state is stored in the first buffer in 
association with the given datum. 

In a thirteenth aspect of the invention, the resetting step 
includes the step of searching for the timestamp in the first 
buffer to identify the snapshot from among a plurality of 
snapshots stored in one of the first processor, the second 
processor, and the storage device external thereto. 

In a fourteenth aspect of the invention, the method further 
includes the step of processing the store and load requests 
in-order and suspending the steps involving the first buffer, 
upon performing a predetermined number of resetting steps. 

These and other aspects, features and advantages of the 
present invention will become apparent from the following 
detailed description of preferred embodiments, which is to 
be read in connection with the accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWING 

FIG. 1 is a block diagram illustrating a simplified multi- 
processor computing system with private caches, to which 
the present invention may be applied; 
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FIG. 2 is a flow diagram illustrating the actions taken be followed but rather, another execution path may be 

upon receiving a memory access request in a multiprocessor followed instead. A cross-interrogate is a protocol action 

environment having a memory hierarchy with private whereby a first processor cache checks whether a specific 

caches, according to the prior art; cache line is resident in at least one other processor's cache 

FIG. 3 is a flow diagram illustrating the actions taken by S in a multiprocessor system. In this context, a cross interro- 

a processor 106 upon receiving a cache cross-interrogate gate request can refer to such test in either a bus-snooping 

request from a processor 102, according to the prior art; or a directory-based implementation of cache consistency. A 

FIG. 4 is a block diagram illustrating a gated store buffer, snapshot is a record of the processor state at a particular 

according to the prior art; execution point. 

FIG. 5 is a flow diagram illustrating the actions taken by 10 The following description and corresponding examples 

a processor 106 upon receiving a cache cross-interrogate will be given based on two instructions (unless otherwise 

request from a processor 102, according to an illustrative noted), a first instruction which is executed out-of-order 

embodiment of the present invention; before a second, logically preceding instruction, which will 

FIG. 6 is a flow diagram illustrating the actions taken by be termed an in-order instruction. Thus, unless otherwise 

a cache controller of a private cache 104 upon receiving a noted . U« designation 'in-order' refers only to the sequential 

memory access request from a processor 102 in a memory relationship between the logically preceding in-order 

hierarchy with private caches corresponding to a multipro- instruction and the first 'out-of-order' instruction. It is to be 

ccssor environment, according to an illustrative embodiment noted that the above two instructions (i.e., both the in-order 

of the present invention* aQ d the out-of-order load instructions) may be in-order or 

FIG. 7 is a flow diagram illustrating the actions taken by " out-of-order with respect to a third instruction (andso forth), 

a processor 106 upon receiving a cache cross-interrogate It is to be understood that the present invention may be 

request from a processor 102, according to another illustra- implemented in various forms of hardware, software, 

tivc embodiment of the present invention; and firmware, special purpose processors, or a combination 

FIG. 8 is a diagram of a multiprocessor computer system 25 thereof. It is to be further understood that, because some of 

that includes an apparatus for reordering and renaming the constituent system components and method steps 

memory references according to an illustrative embodiment depicted in the accompanying figures are preferably imple- 

of the present invention. mented in hardware, the actual connections between the 

system components (or the process steps) may differ 

DETAILED DESCRIPTION OF PREFERRED depending upon the manner in which the present invention 

EMBODIMENTS is programmed. 

The present invention is directed to methods and appa- A genera i description of the present invention will now be 

ratus for reordering and renaming memory references in a prov idcd to introduce the reader to the concepts of the 

multiprocessor computer system. In such a system, invention. Subsequently, more detailed descriptions of vari- 

sequences of instructions are stored in a memory for execu- 35 ous aspects of the invention will be provided, 
lion by a plurality of processors (i.e., two or more). The 

invention is applicable to operations reordered when the ^ general, the present provides a method and apparatus 
program is generated (static reordering) as well as to opera- *°< implementing memory operation reordering in a mulu- 
tions reordered at execution time (dynamic reordering). processor system. A memory system implementing the 
Further, the invention is suitable for software and/or hard- 40 P resenl i nvention can ^PP 0 " speculative execution of 
ware based implementations. memory store operations, arbitrary reordering of memory 
To facilitate a clear understanding of the present store operand with respect to each other and me ability to 
invention, definitions of terms employed herein will now be su PP° rt correct resolution of memory load references in the 
given. A load instruction refers to any instruction performing P resence of multip e outstanding store operations to a single 
a memory read-access and (optionally) computations based 45 memory address ("memory renaming ). The present inyen- 
on the loaded value. Thus, a load instruction may include, tl0n 15 based ° n buffcnn e memor y ™* operations until the 
for example, logic, arithmetic and other instructions which correctness of operations is assured and then committing 
employ data from memory locations as operands. Out-of- dat * to ™™7 ! * "IT* ' reDame 
order execution is a technique by which the operations in a buffer < MRRB > bribed hereinbelow. 
sequential stream of instructions are reordered so that opera- 50 Error conditions (such as speculating across memory 
tions appearing later are executed earlier, if the resources barriers, or resource allocation conflicts between multiple 
required by the later appearing operations are free. Thus, an processors) are resolved using a rollback/commit scheme. In 
out-of-order load instruction may be created, either statically the absence of errors or resource conflicts, the contents of 
or dynamically, by moving a load instruction from its the memory buffer arc committed to memory in-order. When 
original position in a sequence of instructions to an earlier 55 an error condition is detected, the memory buffer is partially 
position in the sequence of instructions. Such out-of-order or completely discarded and the processor state is reset to an 
load instruction identifies a location in memory from which earlier state ("rollback"). When a resource conflict has been 
to read a datum, and a first destination register in which to detected, one or both processors perform the rollback opera- 
place the datum. An ambiguous memory reference refers to tion. 

the case when a memory load operation appears after 60 _ . . , ... , .... 

. . 7 .. . * ,-„ The present invention includes embodiments compatible 

another memory load operation in a sequential instruction . K . , , , . ... . J 

j •. • . -ui . j . • ■ . *• with existuig protocols and other embodiments based on an 

stream, and it is not possible to determine prior to execution , . ° r f . 

of the program whether the memory locations to be accessed cxlended scl 01 protocols. 

by the two memory load operations are different. A specu- The memory reorder and rename buffer (MRRB) is a 
lative operation (e.g., a memory reference) is an out-of-order 65 device which allows store operations to be executed out-of- 
operation that may not necessarily be executed because, for order with respect to each other, and with respect to control- 
example, the execution path in which it is located may not flow altering instructions (e.g., branch instructions). 
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To perform these capabilities, the MRRB performs the memory location is then maintained in the MRRB, while 

following functions: ownership of the corresponding memory line is retained by 

1. Receives gated store requests, which may be oul-of- * e P rivatc <* Qhc until ^ ^ vatuc * amoved from the 
orderwithrespecttoeacbother.Agatedstorereferstoeilher MRRB > ^ mer b J retiring the data value io-oider to the 
an out-ofordcr store which is not committed until the 5 memory hierarchy, or by discarding the data value from the 
in-order point of the store operation, or a conditional store ," . _ , , 

which may later be annulled. A gated store request is a . In addltl0n ' for an y data value maintained in the MRRB, 

request from the CPU to perform a gated store operation io * e ? mcts ? T reta * « least one processor stale snap- 

th MPOn shot preceding the instruction which deposited the value m 

the MKKri. 10 the MRRB. This is required to allow "backing out" of any 

2. Optionally, satisfies load requests by supplying data gated store operation by removing the corresponding value 
values equivalent to an in-order execution sequence of load (and all values entered in the MRRB thereafter) from the 
and store operations, or, optionally, indicates an error con- MRRB, and resetting the processor state. 

dition if ambiguous loads are moved out-of-order with According to the present invention, processors with pri- 

respect lo store requests. vate caches implementing MRRB capabilities acquire exchi- 

3. Supports the capability to commit the values to memory sive control of memory locations for which gated store 
in-order in response to a first control signal. requests have been issued. 

4. Supports the capability to discard one or more values A description of various modes foi •operatior i of "an MRRB 
from the MRRB in response to a second signal. accordin S » *» ^stralive embodiment of the present 

w . . invention will now be given. 

5. Takes actions to ensure memory consistency in 20 [ n a first mode of operation, the MRRB receives a request 
response to cross-interrogate : request, .from other processors ^ ^ a ^ ^ * ffl {mm a ^ 
(those not associated with the MRRB to which the request ^ ^ g ^ ^ * 

is made; when the other processors request memory loca- ^ and ^ ^ 

eated store request in tempo- 

tions currently controlled by the MRRB. k - , . ,\ ., un D D 

3 7 rary memory. Memory locations requested by the MRRB 

According to the present invention, MRRBs are adapted ™ may no( ^ surre ndered by the cache hierarchy in response 

for use with multiprocessor systems by enhanced synchro- t0 pToloco \ requests. Such requests will be handled by 

nization actions which are taken when a cache cross- enhanced control logic described below, 

interrogate finds a requested location in the MRRB. Since , n a second mode of operatiorij ^ MRRB receives a load 

waiting for a value to be released (committed) by the MRRB f ]f a matching locatioQ ^ found in the MRRB, then 

can cause deadlock situations to occur, waiting for a commit 30 ^ MRRB ^ oplional)v bypa&s informat ion as appropri- 

on a speculative value is not possible. a(c tQ guaranlcc mnccl excculion of lhc progra m. This can 

The present invention therefor is based on synchronizing include the action of selecting from more than one cached 
MRRB contents by rolling back execution to a previous, S ( orc request, based upon the in-order sequence of the load 
well-defined state. This eliminates the possibility of a dead- ^ ^ s t ore operations. Alternatively, an MRRB may option- 
lock (since there is no waiting for processor resources). a u y compare the data address of a load operation with the 

It is to be appreciated that an MRRB according to the addresses of one or more gated store values and indicate 

present invention may be implemented as any type of address aliasing conditions which may lead to incorrect 

storage device, such as, for example, a buffer, cache, and so operation of a program with reordered memory operations, 

forth. It is to be further appreciated that the MRRB may be AQ In a third mode of operation, the MRRB receives a request 

a stand alone storage device or the MRRB may be included to commit part or all of its contents to memory. This is 

in an existing storage device such as a cache or processor usually triggered by in-order retirement logic of the proces- 

register(s). In the latter case, the MRRB, while included in sor. In response, the MRRB retires the stored data values to 

the existing device, nonetheless operates in a stand alone the memory hierarchy in original program order. In addition, 

mode with the respect to operations conducted in the exist- 45 ihe MRRB gives up control of the corresponding memory 

ing device. Thus, for example, if an MRRB is implemented locations, which can now be shared with, or surrendered to, 

in an existing private cache (or rcgjster(s)) of a particular other processors in a multiprocessor system in response to 

processor, the portion of the cache or registers) in which the protocol requests. 

MRRB operates is independent of the operations of the i n a fourth mode of operation, the MRRB receives a 

cache as if the MRRB where a separate entity from the 5Q Tcqucst to discard one or more values from the MRRB. This 

private cache. Alternatively, some cross-functionality may ^ usually triggered by a rollback of speculatively executed 

be implemented between the MRRB and the storage device. instructions in the processor, e.g., after a mispredicted 

Given the teachings of the present invention provided branch. The MRRB may surrender control of the corre- 

herein, one of ordinary skill in the related art will be able to sponding memory location. 

contemplate these and similar implementations of the ele- J5 In a fifth modc of opcrat ion, a processor node incorpo- 

ments of the invention. rat j ng a pr i va te cache and an MRRB receives a cross- 

The present invention is preferably used with a memory interrogate request for a memory location. If the MRRB 

consistency scheme which allows a private cache to obtain contains the address which is the subject of the request, at 

exclusive ownership of memory locations for writc-acccss. least one of the MRRBs in either the data requester (i.e., the 

Multiprocessor coherence protocols are described by Curt 60 processor which is trying to access data), or the data owner 

Schimmel, in "Multiprocessor Systems with Caches", Unix (i.e., the processor which has exclusive control of the 

Systems for Modern Architectures, Addison-Wesley, pub., memory location and currently maintains a value in the 

Reading, Mass. 1994. MRRB) is forced to discard the contents of the MRRB and 

According to the present invention, when a gated store roll back its processor state to an earlier snapshot which 

request is received by a processor, the corresponding 65 corresponds lo an empty MRRB. 

memory location is acquired in exclusive access (write) state A description of a method/apparatus for reordering and 

by the private cache of the processor. The data value for the renaming memory references in a multiprocessor computer 
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system according 10 an illustrative embodiment of the 
present invention will now be given. The method/apparatus 
is compatible with existing system bus protocols, i.e., a 
processor implementing the method/apparatus can be inte- 
grated in existing computer designs. 

The processor contains an MRRB which performs reor- 
dering and/or renaming of memory references. When a 
memory location is first processed for inclusion in the 
MRRB, the processor obtains the "containing" cache line 
and maintains ownership of the cache line until all MRRB 
entries which refer to memory locations contained in this 
line can be committed to memory in-order. For every cache 
line requested, a snapshot of the architected processor state 
is made such that the snapshot captures the processor state 
prior to processing the current request. 

It is to be appreciated that it is not necessary to make a 
snapshot for every request. The only requirement is that 
there be a snapshot preceding the current request which 
captures the processor state such that a successful rollback 
can be performed. A single snapshot corresponding to the 20 
state of the processor with an empty MRRB is sufficient 

When processor 106 requests a data item according to 
FIG. 3, processor 102 processes a cross-interrogate request 
according to FIG. 5. FIG. 5 represents an augmented pro- 
cessing capability of cross-interrogates in accordance with 
the present invention. For illustrative purposes, a cross 
interrogate can either be in the form of an explicit request in 
a directory-based coherence protocol, or a snoop action in a 
bus protocol with snooping. However, other forms of cross- 
interrogates may be used in accordance with the present 
invention. 

Snooping refers to a bus-based coherence protocol. In this 
protocol, each node monitors the system bus for requests 
from other processors. When a node identifies a request from 
another node for a memory location it currently maintains in 
its own cache, coherence actions are performed. 

FIG. 5 is a flow diagram illustrating the actions taken by 
a processor 106 upon receiving a cache cross-interrogate 
request from a processor 102 according an illustrative 
embodiment of the present invention. It is to be appreciated 
that the method of FIG. 5 is performed (by processor 106) 
in response to step 224 of FIG. 2. Accordingly, step 224 is 
considered a protocol action and the output of the method of 
FIG. 5 is considered a protocol answer. Processor 106 
described with respect to FIG. 5 is presumed to be a 
processor that is compatible with existing bus protocols. 
That is, the protocol action and protocol answer between the 
two processors are the same as in the prior art. This allows 
the re-use of bus infrastructure from previous systems, and 
the mixing of standard processors and processors with 
enhanced capabilities according to the present invention. 

When processor 106 receives a cross interrogate request 
from processor 102 (step 510), processor 106 searches a 
private cache directory to determine whether the request 
refers to a location in the private cache 108 (including the 
MRRB) of the processor 106 (step 512). If the request docs 
not refer to a location in the private cache hierarchy of 
processor 106, then the method is terminated. 

However, if the request refers to a location in the private 
cache hierarchy of processor 106, then the MRRB is 
searched to determine whether the request refers to a loca- 
tion in the MRRB (step 516). Note that the search step (step 
516) can be omitted, if an indication flag is added to the 
cache directory. If the request does not refer to a location in 
the MRRB, then private cache 108 gives up (surrenders 
control of the requested item according to the method of the 
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bus protocol being used) the requested item in response to 
the cross- interrogate request and performs whatever proto- 
col actions necessary to transfer control to the requesting 
cache (i.e., private cache 104) (step 518). However, if the 
request refers to a location in the MRRB, then, to ensure 
memory coherence without the possibility of a deadlock, the 
MRRB is invalidated and processor 106 performs a rollback 
action by resetting the processor state to a prior snapshot 
(step 520). Once the state has been rolled back, the entries 
remain in private cache 108, but have not been modified. 
The entries can then be given up as necessary to fulfill 
cross-interrogate requests. 

Note that the entire MRRB is not necessarily discarded A 
partial rollback is possible by discarding only the found 
entries and all later entries (i.e., entries recorded after the 
time corresponding to the earliest found entry), and by 
rolling back to the state immediately preceding the current 
entry. This involves tradeoffs as to the complexity of the 
rollback logic and the number of snapshots maintained. 

FIG. 6 is a flow diagram illustrating the actions taken by 
the cache controller of private cache 104 upon receiving a 
memory access request from processor 102 in a memory 
hierarchy with private caches corresponding to a multipro- 
cessor environment according to an illustrative embodiment 
of the present invention. In the embodiment of FIG. 6, the 
processors are presumed to have a modified bus protocol. 
The modified bus protocol can be based on an existing bus 
protocol, with an additional protocol response indicating a 
processor's refusal to give up a processor line in response to 
a cross-interrogate request. 

Processor 106 maintains control of a plurality of memory 
locations in its MRRB. Processor 102 accesses a memory 
location according to the method of FIG. 6. 

The cache controller of processor 102 receives a request 
for data from processor 102 (step 610). It is then determined 
whether the request refers to a location in private cache 104 
of processor 102 (step 612). If the request does not refer to 
a location in private cache 104 of processor 102, then the 
method proceeds to step 620. 

However, if the request refers to a location in private 
cache 104 of processor 102, then it is determined whether 
the entry in private cache 104 corresponding to the request 
has the required permissions (i.e., if a write request has been 
issued, whether the cache entry is in exclusive ownership 
mode which allows the write request to proceed, or in shared 
ownership mode, which only allows read requests to be 
processed) (step 614). If the entry does not have the required 
permissions, then the method proceeds to step 620. 
However, if the entry has the required permissions, then the 
request is satisfied from private cache 104 (step 618), and 
the method is terminated. 

At step 620, it is determined, via a cross-interrogate, 
whether the location is resident in private cache 108 of 
processor 106. If the location is not resident in private cache 
108 of processor 106, then the method proceeds to step 632. 
However, if the location is resident in private cache 108 of 
processor 106, then the location is requested with the 
appropriate permissions (via a cross-interrogate) (step 624). 
It is then determined whether the request was successful 
(step 626). If the request was successful, then the method 
proceeds to step 634 (step 628). However, if the request was 
not successful, then the MRRB of processor 102 is invali- 
dated and processor 102 is restored based on a snapshot 
corresponding to the architected processor state prior to the 
first MRRB request (step 630), and the method is termi- 
nated. 
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At step 632, the memory location is fetched from main 
memory, and the method proceeds to step 634. At step 634, 
the memory location (which was either fetched at step 632 
or received in response to the cross-interrogate at step 624) 
is stored in private cache 108, the request is satisfied, and the 
method is terminated 

FIG. 7 is a flow diagram illustrating the actions taken by 
processor 106 upon receiving a cache cross-interrogate 
request from processor 102 according to the illustrative 



corresponding processor receives the gated store request to 
store the datum. 

The processor 102, private cache 812a, and buffer 810a 
comprise a node 803 of system 100. The processor 106, 
private cache 8126, and buffer 8106 comprise a node 807 of 
system 800. 

The apparatus also includes an indicator 814 adapted to 
indicate an error condition when a load request for a given 
datum corresponds to the data in one of the plurality of 



embodiment of FIG. 6. It is to be appreciated that the method «> buffers 810a-6. The load request for the given datum is 

of FIG. 7 is performed (by processor 106) in response to step received by the cache of the processor corresponding to the 

614 of FIG. 6. Accordingly, step 614 is considered a protocol buffer containing the given datum. Thus, if buffer 810a, 

action and the output of the method of FIG. 7 is considered wn ich corresponds to processor 102 in the above example, 

a protocol answer. In the embodiment of FIG. 7, the pro- contains the given datum, then the load request for the given 

cessors are presumed to have a modified bus protocol as in « (j atum ^ received by cache 104 of processor 102. The load 



FIG. 6. That is, the protocol answer has been augmented to 
indicate that processor 106 (of FIG. 7) does not give up 
control of the requested memory location to the requesting 
processor (processor 102 of FIG. 6). 

Upon processor 106 receiving a cross interrogate request 
from processor 102 (step 710), processor 106 searches its 
private cache directory to determine whether the request 
refers to a location in its private cache 108 (step 712) If the 
request does not refer to a location in private cache 108 of 
processor 106, then the method is terminated. 

However, if the request refers to a location in private 
cache 108 of processor 106, then the MRRB of processor 
106 is searched to determine whether the request refers to a 
location in the MRRB (step 716). If the request refers to a 
location in the MRRB, then processor 106 refuses to give up 
the request line and indicates this by returning a negative 
protocol response to the requesting processor (i.e., processor 
102) (step 718). However, if the request does not refer to a 
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request for the given datum is provided from the non- 
corresponding processor. Thus, to continue the example, the 
load for the given datum would be received from processor 
106. 

The apparatus additionally includes a reset device 816 
adapted to reset a current state of at least one of the 
processors to an earlier state when the load request for the 
given datum corresponds to the data in one of the plurality 
of buffers 81Qa-b (buffer 810o in the above example). 

Optionally, the apparatus may further include commit- 
ment logic 818o— b adapted to commit at least some of the 
data in the buffers to an architected memory slate of the 
computer system to remove the at least some of the data 
from the buffers. If the commitment logic $l$a~b is 
included in the apparatus, then the indicator 814a-6 indi- 
cates the error condition only when the given datum is not 
committed. Given the teachings of the present invention 
provided herein, one of ordinary skill in the related art will 



location in the MRRB, then private cache 108 gives up the 3J bc able ^ contemplate these and similar implementations of 

requested item in response to the cross-interrogate request the elements of the invention. 

and performs whatever protocol actions are necessary to A ^ Qf timizations and a]temate 

transfer control to the requesting cache (i.e., private cache cmbodimcnts of lnc prcscnt mvention ^ now 5e given , For 



104) (step 720). 



example, the reordering and renaming of operations can 



FIG. 8 is a diagram of a multiprocessor computer system 40 occur dynamically, or statically, at runtime or compile time, 



that includes an apparatus for reordering and renaming 
memory references according to an illustrative embodiment 
of the prcscnt invention. The system 800 includes a proces- 
sor 102, a processor 106, a main memory 110, a main 



by hardware or software, and can optionally include a binary 

translation step. 

In response to a rollback event, operation can either be 

, . , . „, „ restarted based on hardware execution, or using specially 

memory U2, and mput/outpul (I/O devices 114. The com- 45 {[ fa . (o minimize coDflicls 

nnnonlr nf cuctsm orji oil rvnoratiirolti i-/~iltnl<»rl trt Al/'n ntrpf r 

In an optimized embodiment of the present invention, a 



ponenLs of system are all operatively coupled to each other 
through system bus 116. I/O 114 collectively refers to I/O 
adapters (e.g., video card) and the I/O devices (e.g., monitor) 
operatively coupled thereto. Main memory 110 and main 
memory 112 are shared between the processors. 50 

The apparatus includes a plurality of buffers HlOa-b. Each 
buffer corresponds to one of the processors. In the embodi- 
ment of FIG. 8, buffer 810o corresponds to processor 102, 
and buffer 8106 corresponds to processor 106. Moreover, 
each buffer is adapted to store a datum when a corresponding 55 
processor receives a gated store request to store the datum, 
and to provide a particular datum to the corresponding 
processor from among the stored data based on an in-order 



mechanism selects whether to perform a rollback of the 
processor owning the cache line (as in the first illustrative 
embodiment), or of the processor requesting the cache line 
(as in the second illustrative embodiment), or of both 
processors. The selection criterion can be randomly 
generated, or based on a protocol to balance forward 
progress between multiple processors, e.g., round-robin. 

In another optimization, the memory locations held in the 
MRRB can be committed in a "lazy" manner, i.e., only when 
a request for a memory location is received. 

In yet another optimized embodiment, a processor receiv- 
ing a cross-interrogate may bc allowed to execute for a 



sequence of load and store operations when a load request is 

received from the corresponding processor to load the 60 maximum predetermined number of cycles if a match is 

particular datum. found in the MRRB, This may allow the processor to 

The apparatus further includes a plurality of private commit the value to memory, thereby eliminating the need 

caches 812a-6. Each cache corresponds to one of the to perform an expensive rollback operation, 

processors. In the embodiment of FIG. 8, cache 104 corre- In still another optimized embodiment, a response to a 

sponds to processor 102, and cache 108 corresponds to 65 cross-interrogate request can be deferred without time limit 

processor 106. Moreover, each cache is adapted to exclu- until the requested memory location is ready to be commit- 

sively acquire a cache line containing the datum when the ted to memory, if it can be established that doing so does not 
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create a deadlock (e.g., if the other processor does not 
maintain any values in its MRRB or does not implement an 
MRRB). 

In a further optimized embodiment, one or more proces- 
sors enter a mode where memory requests are processed 
in-order without use of the MRRB to reduce the occurrence 
of rollback events. This mode transition may be caused by 
passing a threshold of rollback events, whereupon serial 
processing of memory operations is initiated. Depending on 
the implementation, memory reordering and renaming is 
restarted automatically by the hardware, or by the interven- 
tion of software (e.g., the operating system after a schedul- 
ing step to avoid further MRRB conflicts and rollbacks.) 

In still a further optimized embodiment, counters deter- 
mine the number of MRRB rollback events to guide the 
operating system to schedule processes onto different pro- 
cessors so as to minimize conflicts between processors. 

In yet still a further optimized embodiment, no rollback is 
performed in response to a speculative load. Instead, one of 
a number of possible values (such as the value stored in the 
main memory, or the value stored by an MRRB) is returned 
in response to a speculative load. Later disambiguation then 
reads an authoritative value and performs appropriate repair 
actions. Such disambiguation may include, for example, a 
load-verify operation as described in U.S. Pat. No. 5,758, 
051, entitled "Method and Apparatus for Reordering 
Memory Operations in a Processor", issued on May 26, 
1998, assigned to the assignee herein, the disclosure of 
which is incorporated herein by reference. 

The memory model implemented by the present invention 
ensures that loads and stores are consistent with respect to 
any given memory location, and in-order with respect to the 
memory subsystem. Store operations to different addresses 
may appear to be out-of-order at other processors. In an 
optimized embodiment, a total ordering of store operations 
can be achieved by requesting the memory locations in the 
original order of operations. 

In another optimized embodiment, the register renaming 
buffer and memory reorder (MRRB) and the rename buffer 
may be implemented as a single unit. Moreover, the memory 
reorder and rename buffer (MRRB) may be implemented 
using a gated store buffer. 

Although the illustrative embodiments have been 
described herein with reference to the accompanying 
drawings, it is to be understood that the present system and 
method is not limited to those precise embodiments, and that 
various other changes and modifications may be affected 
therein by one skilled in the art without departing from the 
scope or spirit of the invention. All such changes and 
modifications arc intended to be included within the scope of 
the invention as defined by the appended claims. 

What is claimed is: 

1. A method for reordering and renaming memory refer- 
ences in a multiprocessor computer system having at least a 
first and a second processor, the first processor having a first 
private cache and a first buffer, the second processor having 
a second private cache and a second buffer, said method 
comprising the steps of: 

for each of a plurality of gated store requests received by 
the first processor to store a datum, 
exclusively acquiring a cache line that contains the 

datum by the first private cache, and 
storing the datum in the first buffer; 
upon the first buffer receiving a load request from the first 
processor to load a particular datum, 
providing the particular datum to the first processor 
from among the data stored in the first buffer based 
on an in-order sequence of load and store operations; 
and 
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upon the first cache receiving a load request from the 
second cache for a given datum, 
indicating an error condition and resetting a current 
state of at least one of the processors to an earlier 
5 state when the load request for the given datum 

corresponds to the data stored in the first buffer. 

2. The method according to claim 1, further comprising 
the step of committing at least some of the data in the first 
buffer to an architected memory state of the computer 
system, prior to said indicating step, to remove the at least 
some of the data from the first buffer, and wherein said 
indicating step is performed only when the given datum in 
the first buffer is not committed. 

3. The method according to claim 2, wherein said com- 
mitting step commits a specified datum to the architected 

15 memory state of the computer system when a gated store 
request corresponding to the specified datum is in-order with 
respect to all instructions that precede the gated store 
request. 

4. The method according to claim 1, wherein said reset - 
20 ting step comprises the step of discarding at least some of the 

data in the first buffer. 

5. The method according to claim 1, wherein said reset- 
ting step comprises the step of discarding the given datum 
from the first buffer and all data stored thereafter. 

25 6. The method according to claim 1, further comprising 
the step of releasing the cache line when operations referring 
to the cache line have completed execution in-order. 

7. The method according to claim 1, further comprising 
the step of releasing the cache line, when the datum con- 

30 tained within the cache line is committed to an architected 
memory state of the computer system in-order or when the 
datum is discarded from the first buffer. 

8. The method according to claim 1, wherein the earlier 
stale corresponds to an operation immediately preceding the 

35 gated store request that stored the given datum in the first 
buffer. 

9. The method according to claim 1, further comprising 
the step of generating a snapshot of the earlier stale. 

10. The method according to claim 9, wherein said 
4 0 generating step comprises one of the steps of copying 

contents of registers corresponding to the earlier slate, and 
maintaining a record of incremental state changes from at 
least one state preceding the earlier state up to the earlier 
state. 

45 11. The method according to claim 1, further comprising 
the step of storing a snapshot of the earlier state in the first 
buffer. 

12. The method according to claim 1, further comprising 
the steps of: 

50 storing a snapshot of the earlier state in one of the first 
processor, the second processor, and a storage device 
external thereto; and 
storing a timestamp corresponding to the snapshot of the 
earlier state in the first buffer in association with the 

ss given datum. 

13. The method according to claim 12, wherein said 
resetting step comprises the step of searching for the times- 
tamp in the first buffer to identify the snapshot from among 
a plurality of snapshots stored in one of the first processor, 

60 the second processor, and the storage device external 
thereto. 

14. The method according to claim 1, further comprising 
the step of processing the store and load requests in-order 
and suspending said steps involving the first buffer, upon 

65 performing a predetermined number of resetting steps. 

15. The method according to claim 1, further comprising 
the steps of: 
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when the load request is a speculative load request: 
suspending said indicating and resetting steps; 
providing, to the second cache in response to the load 
request, a specified datum from among the data 
stored in at least one the first buffer, the second 
buffer, and a main memory of the computer system 
that is shared between the processors; and 
providing an authoritative datum to the second cache 
and performing corresponding repair operations with 
respect to tbe specified datum previously provided to 
the second cache, upon disambiguation of the specu- 
lative load request. 

16. The method according to claim 15, further comprising 
the steps of: 

determining whether the authoritative datum is the speci- 
fied datum; and 

resetting a processor in-order state and performing repair 
actions, when the authoritative datum is the specified 
datum. 

17. An apparatus for reordering and renaming memory 
references in a multiprocessor computer system having a 
plurality of processors, comprising: 

a plurality of buffers, each buffer corresponding to one of 
the processors, and adapted to store a datum when a 
corresponding processor receives a gated store request 
to store the datum, and to provide a particular datum to 
the corresponding processor from among the stored 
data based on an in-order sequence of load and store 
operations when a load request is received from the 
corresponding processor to load the particular datum; 

a plurality of private caches, each cache corresponding to 
one of the processors, and adapted to exclusively 
acquire a cache line containing the datum when the 
corresponding processor receives the gated store 
request to store the datum; 

an indicator adapted to indicate an error condition when 
a load request for a given datum corresponds to the data 
in one of the plurality of buffers, the load request 
received by the cache of the processor corresponding to 
the one of the plurality of buffers; and 



t9,361 Bl 

22 

a reset device adapted to reset a current state of at least 
one of the processors to an earlier state when the load 
request for the given datum corresponds to the data in 
the one of the plurality of buffers. 
5 18. The apparatus according to claim 17, further com- 
prising commitment logic adapted to commit at least some 
of the data in the buffers to an architected memory state of 
the computer system to remove the at least some of the data 
from tbe buffers, and wherein said indicator indicates the 
to error condition only when the given datum is not committed. 

19. The apparatus according to claim 18, wherein said 
commitment logic commits the given datum to the archi- 
tected memory state of the computer system when a given 
gated store request corresponding to the given datum is 

is in-order with respect to all instructions that precede the 
given gated store request. 

20. The apparatus according to claim 17, wherein said 
reset device is further adapted to discard at least some of the 
data in the buffers. 

20 21. The apparatus according to claim 17, wherein said 
reset device is further adapted to discard the given datum 
from the one of the plurality of buffers and all data stored 
thereafter. 

22. The apparatus according to claim 17, wherein each 
25 cache is further adapted to release the cache line when 

operations referring to the cache line have completed execu- 
tion in-order. 

23. The apparatus according to claim 17, wherein each 
cache is further adapted to release the cache line, when the 

30 datum contained within the cache line is committed to an 
architected memory state of the computer system in-order or 
when the datum is discarded from one of the buffers. 

24. The apparatus according to claim 17, wherein the 
earlier state corresponds to an operation immediately pre- 

35 ceding the gated store request that stored the given datum in 
one of the buffers. 

25. The apparatus according to claim 17, wherein each of 
the buffers is further adapted to store a snapshot of the earlier 
state. 

40 
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